Copy number variation across the WatSeq dataset

Ricardo H. Ramirez-Gonzalez

23 Sept 2020

Detection of medium and large CNV

Objective

  • To find medium and large Copy Number Variation (CNV)

Initial data

Sample.1 Sample.2 Sample.3 Sample.4
chr1:1:100 19 24 12 14
chr2:101:200 25 23 17 18
chr1:201:300 20 27 12 18
chr2:301:400 19 25 16 10
chr2:401:500 17 27 14 25

Initial data

  • Coverage across regions a region is not uniform

Normalisation

\(x_{i,j}=\frac{WindowCoverage_{i,j}\times10^{9}}{WindowLength_{i}\times{totalReadsSample_{j}}}\)

\(xnorm_{i,j}=\frac{x_{i,j}}{mean(X_{i})}\)

Normalise by sample

\(x_{i,j}=\frac{WindowCoverage_{i,j}\times10^{9}}{WindowLength_{i}\times{totalReadsSample_{j}}}\)

Sample.1 Sample.2 Sample.3 Sample.4
chr1:1:100 19 24 12 14
chr2:101:200 25 23 17 18
chr1:201:300 20 27 12 18
chr2:301:400 19 25 16 10
chr2:401:500 17 27 14 25
totalReadsSample 100 126 71 85

Normalisation by window

\(xnorm_{i,j}=\frac{x_{i,j}}{mean(X_{i})}\)

Sample.1 Sample.2 Sample.3 Sample.4 Window Mean
chr1:1:100 1919192 1924002 1707213 1663696 1803526
chr2:101:200 2525253 1843835 2418552 2139037 2231669
chr1:201:300 2020202 2164502 1707213 2139037 2007739
chr2:301:400 1919192 2004169 2276284 1188354 1847000
chr2:401:500 1717172 2164502 1991748 2970885 2211077

Normalised coverage

Sample.1 Sample.2 Sample.3 Sample.4
chr1:1:100 1.0641334 1.0668004 0.9465976 0.9224686
chr2:101:200 1.1315532 0.8262135 1.0837411 0.9584922
chr1:201:300 1.0062077 1.0780796 0.8503163 1.0653964
chr2:301:400 1.0390862 1.0850942 1.2324225 0.6433970
chr2:401:500 0.7766223 0.9789357 0.9008047 1.3436373

Normalised coverage

Second normalisation

  • Remove the windows with \(sd(window) > 0.3\)
  • Repeat normalisation
  • Exclude the datapoints with “0”

Standard deviation of lines across samples.

5 out of 823 lines are noisy (\(\sigma > 0.45\)).

line SD
WATDE0009 0.75
WATDE0039 1.08
WATDE0056 0.51
WATDE0060 0.52
WATDE0090 0.50

Regular vs noisy line

Regular line

Noisy line

Merging continous lines

  • Individual CNVs events may be informative, but we want to find large events

Stich CNV candidates

Stich CNV candidates

Stich CNV candidates

Stich CNV candidates

Stich CNV candidates

CNVs across the full genome

Can we get an overview across a line?

CNV length distributions

There are 43,412,060 CNV events across 823 lines

CNV length distributions (over 200bp)

  • The minimum size that we have in this dataset is 200bp 22,294,613 (51.36 %) are not singletons

Deletion in 4B may be incomplete

Explore global view

Deletion in full

Next steps

  • Improve stiching algorithm
  • Use smaller window size (150bp, 100bp)
  • Find genes/regions more prone to have CNVs
  • Analyse in detail known CNVs